Bootstrap Hypothesis Testing (from scratch)#

A bootstrap hypothesis test is a resampling-based way to get a p-value when the sampling distribution of a statistic is hard (or undesirable) to derive analytically.

This notebook focuses on:

  • what bootstrap tests are used for (and what they are not)

  • how to build a null distribution with bootstrap

  • how to compute and interpret a bootstrap p-value

  • a low-level implementation using only NumPy, plus Plotly visuals

import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)

SEED_DATA = 42
SEED_BOOTSTRAP = 123

rng_data = np.random.default_rng(SEED_DATA)

Prerequisites#

  • You know what a null hypothesis \(H_0\) and alternative \(H_1\) are.

  • You know what a test statistic is (a number computed from data).

  • You can interpret a p-value at a basic level.

If any of that is fuzzy, still continue: this notebook keeps the math light and emphasizes mechanics and interpretation.

1) What problem does a bootstrap test solve?#

In many tests (e.g. z-test, t-test), we can compute a p-value because we know (or approximate) the distribution of a statistic under \(H_0\).

But sometimes:

  • the statistic is complicated (median, trimmed mean, ratio, custom metric)

  • the usual distributional assumptions are uncomfortable (strong skew, heavy tails)

  • you want a method that matches the sampling process more directly

Bootstrap methods approximate sampling distributions by resampling the observed data.

Key idea:

  • Treat the observed sample as an estimate of the population.

  • Resample from it (with replacement) many times.

  • Recompute the statistic each time to get an empirical distribution.

A quick visual: bootstrap distribution of the sample mean#

Below, we draw one skewed sample and then bootstrap the mean many times to visualize the sampling distribution estimate.

sample = rng_data.exponential(scale=1.0, size=40)
observed_mean = float(np.mean(sample))

n_boot_demo = 5_000
rng_demo = np.random.default_rng(SEED_BOOTSTRAP)

idx = rng_demo.integers(0, sample.size, size=(n_boot_demo, sample.size))
bootstrap_means = sample[idx].mean(axis=1)

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Observed sample", "Bootstrap distribution of the mean"),
)

fig.add_trace(go.Histogram(x=sample, nbinsx=30, name="sample"), row=1, col=1)
fig.add_trace(go.Histogram(x=bootstrap_means, nbinsx=40, name="boot means"), row=1, col=2)

fig.add_vline(
    x=observed_mean,
    line_width=3,
    line_color="crimson",
    annotation_text=f"observed mean = {observed_mean:.3f}",
    annotation_position="top",
    row=1,
    col=2,
)

fig.update_layout(
    height=380,
    width=950,
    showlegend=False,
    title_text="Bootstrap intuition: resample → recompute → distribution",
)
fig.update_xaxes(title_text="value", row=1, col=1)
fig.update_xaxes(title_text="bootstrap mean", row=1, col=2)
fig.update_yaxes(title_text="count", row=1, col=1)
fig.update_yaxes(title_text="count", row=1, col=2)
fig.show()

2) Turning bootstrap into a hypothesis test#

A hypothesis test needs the distribution of a statistic under the null hypothesis.

Let:

  • \(T(\text{data})\) be your test statistic (e.g., difference in means)

  • \(t_{\text{obs}} = T(\text{observed data})\)

A (two-sided) bootstrap p-value is estimated as:

\[\hat{p} = \frac{1 + \sum_{b=1}^{B} \mathbb{1}\left(|T_b^* - T_0| \ge |t_{\text{obs}} - T_0|\right)}{B + 1}\]

Where:

  • \(T_b^*\) are bootstrap statistics computed from null-consistent resamples

  • \(T_0\) is the null value of the statistic (often 0)

  • the +1 correction avoids reporting exactly 0 when \(B\) is finite

Important subtlety:

  • If you bootstrap directly from the observed data, the bootstrap distribution is typically centered around the observed effect, not around \(H_0\).

  • For a test, we need to generate resamples from a null world where \(H_0\) is true.

How to build the null world depends on \(H_0\). A common (and intuitive) approach for mean-based tests is centering / shifting the samples so the null is satisfied, then bootstrapping.

Null-world construction for a two-sample mean test#

Suppose we have two independent groups:

  • Control sample: \(X_1, \dots, X_n\)

  • Treatment sample: \(Y_1, \dots, Y_m\)

We want to test: $\(H_0: \mu_Y - \mu_X = \Delta_0\)$

We can enforce \(H_0\) by shifting both samples so their means match what they should be under the null, while keeping their shapes:

  1. Compute the pooled mean \(\bar{Z}\) of all observations.

  2. Choose null means \((\mu_{X,0}, \mu_{Y,0})\) such that:

    • \(\mu_{Y,0} - \mu_{X,0} = \Delta_0\)

    • the pooled mean is preserved.

  3. Shift each group to those null means.

  4. Bootstrap-resample within each group and recompute the difference in means.

This yields an empirical null distribution for the statistic.

def bootstrap_diff_in_means(sample_a, sample_b, *, n_boot=20_000, rng):
    """Bootstrap distribution of mean(sample_b) - mean(sample_a)."""
    sample_a = np.asarray(sample_a, dtype=float)
    sample_b = np.asarray(sample_b, dtype=float)

    n_a = sample_a.size
    n_b = sample_b.size

    idx_a = rng.integers(0, n_a, size=(n_boot, n_a))
    idx_b = rng.integers(0, n_b, size=(n_boot, n_b))

    boot_a = sample_a[idx_a].mean(axis=1)
    boot_b = sample_b[idx_b].mean(axis=1)

    return boot_b - boot_a


def bootstrap_null_diff_in_means(sample_a, sample_b, *, delta0=0.0, n_boot=20_000, rng):
    """Null bootstrap distribution for H0: mean(B) - mean(A) = delta0.

    Strategy: shift both groups so their means satisfy H0, then resample.
    """
    sample_a = np.asarray(sample_a, dtype=float)
    sample_b = np.asarray(sample_b, dtype=float)

    n_a = sample_a.size
    n_b = sample_b.size

    pooled = np.concatenate([sample_a, sample_b])
    pooled_mean = float(np.mean(pooled))

    # Pick (mu_a0, mu_b0) that preserve pooled mean and enforce mu_b0 - mu_a0 = delta0.
    total = n_a + n_b
    mu_a0 = pooled_mean - (n_b / total) * delta0
    mu_b0 = pooled_mean + (n_a / total) * delta0

    sample_a_null = sample_a - np.mean(sample_a) + mu_a0
    sample_b_null = sample_b - np.mean(sample_b) + mu_b0

    idx_a = rng.integers(0, n_a, size=(n_boot, n_a))
    idx_b = rng.integers(0, n_b, size=(n_boot, n_b))

    boot_a = sample_a_null[idx_a].mean(axis=1)
    boot_b = sample_b_null[idx_b].mean(axis=1)

    return boot_b - boot_a


def bootstrap_p_value(null_stats, observed_stat, *, delta0=0.0, alternative="two-sided"):
    """Compute a bootstrap p-value given a null distribution of the statistic."""
    null_stats = np.asarray(null_stats, dtype=float)
    observed_stat = float(observed_stat)

    if alternative not in {"two-sided", "greater", "less"}:
        raise ValueError("alternative must be one of: 'two-sided', 'greater', 'less'")

    if alternative == "two-sided":
        extreme = np.abs(null_stats - delta0) >= abs(observed_stat - delta0)
    elif alternative == "greater":
        extreme = null_stats >= observed_stat
    else:  # "less"
        extreme = null_stats <= observed_stat

    # +1 correction: avoids returning exactly 0 for finite B.
    return (int(np.sum(extreme)) + 1) / (null_stats.size + 1)


def bootstrap_ci_percentile(bootstrap_stats, *, conf_level=0.95):
    """Percentile bootstrap confidence interval."""
    bootstrap_stats = np.asarray(bootstrap_stats, dtype=float)
    alpha = 1.0 - float(conf_level)
    low, high = np.quantile(bootstrap_stats, [alpha / 2, 1 - alpha / 2])
    return float(low), float(high)

3) Worked example: two independent samples#

Imagine an experiment where we measure a time-to-event metric (often skewed), like “time on page” or “time until first click”.

We’ll simulate:

  • a control group

  • a treatment group with a positive location shift

Then we test: $\(H_0: \mu_{\text{treatment}} - \mu_{\text{control}} = 0\)$

control = rng_data.exponential(scale=1.0, size=50)
treatment = rng_data.exponential(scale=1.0, size=45) + 0.35

observed_diff = float(np.mean(treatment) - np.mean(control))
observed_diff
0.31111429489836495
fig = go.Figure()
fig.add_trace(
    go.Violin(
        y=control,
        name="Control",
        box_visible=True,
        meanline_visible=True,
        points="all",
        jitter=0.25,
    )
)
fig.add_trace(
    go.Violin(
        y=treatment,
        name="Treatment",
        box_visible=True,
        meanline_visible=True,
        points="all",
        jitter=0.25,
    )
)

fig.update_layout(
    title="Raw data (skewed): control vs treatment",
    yaxis_title="metric value",
    height=420,
    width=750,
)
fig.show()
n_boot = 30_000

null_stats = bootstrap_null_diff_in_means(
    control,
    treatment,
    delta0=0.0,
    n_boot=n_boot,
    rng=np.random.default_rng(SEED_BOOTSTRAP),
)

p_two_sided = bootstrap_p_value(null_stats, observed_diff, alternative="two-sided")
p_greater = bootstrap_p_value(null_stats, observed_diff, alternative="greater")

print(f"Observed mean difference (treatment - control): {observed_diff:.4f}")
print(f"Bootstrap p-value (two-sided): {p_two_sided:.4f}")
print(f"Bootstrap p-value (greater, i.e. treatment > control): {p_greater:.4f}")
Observed mean difference (treatment - control): 0.3111
Bootstrap p-value (two-sided): 0.0867
Bootstrap p-value (greater, i.e. treatment > control): 0.0404
extreme = np.abs(null_stats) >= abs(observed_diff)

fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=null_stats[~extreme],
        nbinsx=70,
        name="null distribution",
        marker_color="#4C78A8",
        opacity=0.85,
    )
)
fig.add_trace(
    go.Histogram(
        x=null_stats[extreme],
        nbinsx=70,
        name="as/extreme as observed",
        marker_color="#E45756",
        opacity=0.95,
    )
)

fig.add_vline(
    x=observed_diff,
    line_width=3,
    line_color="crimson",
    annotation_text=f"observed = {observed_diff:.3f}",
    annotation_position="top",
)

fig.update_layout(
    barmode="overlay",
    title=f"Null distribution via bootstrap (two-sided p ≈ {p_two_sided:.4f})",
    xaxis_title="mean difference under H0",
    yaxis_title="count",
    height=430,
    width=900,
)
fig.show()

How to interpret the p-value (what it exactly means)#

A p-value answers this question:

If the null hypothesis were true, how often would we see a statistic at least as extreme as the one we observed?

In this notebook, that probability is approximated by the fraction of null bootstrap resamples whose mean difference is as (or more) extreme than the observed mean difference.

Crucial interpretation points:

  • The p-value is not \(P(H_0 \mid \text{data})\) (it is not “the probability the null is true”).

  • A small p-value means the observed result would be rare under \(H_0\), so it is evidence against \(H_0\).

  • A large p-value means the result is not surprising under \(H_0\); it does not prove \(H_0\).

  • The p-value does not tell you whether the effect is important; it is influenced by sample size.

4) Effect size + uncertainty (bootstrap confidence interval)#

Hypothesis tests answer “is this surprising under \(H_0\)?”. Often you also want:

  • the estimated effect size (here: mean difference)

  • uncertainty around that effect (e.g. a 95% CI)

A common companion is a percentile bootstrap confidence interval from the standard bootstrap distribution (bootstrapping directly from the observed groups).

boot_stats = bootstrap_diff_in_means(
    control,
    treatment,
    n_boot=n_boot,
    rng=np.random.default_rng(SEED_BOOTSTRAP),
)

ci_low, ci_high = bootstrap_ci_percentile(boot_stats, conf_level=0.95)
boot_se = float(np.std(boot_stats, ddof=1))

print(f"Observed mean difference: {observed_diff:.4f}")
print(f"Bootstrap SE (approx): {boot_se:.4f}")
print(f"95% percentile bootstrap CI: [{ci_low:.4f}, {ci_high:.4f}]")
Observed mean difference: 0.3111
Bootstrap SE (approx): 0.1816
95% percentile bootstrap CI: [-0.0519, 0.6593]
fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=boot_stats,
        nbinsx=70,
        name="bootstrap effect distribution",
        marker_color="#4C78A8",
        opacity=0.9,
    )
)

fig.add_vline(
    x=0.0,
    line_width=2,
    line_color="black",
    line_dash="dash",
    annotation_text="null (0)",
    annotation_position="top",
)
fig.add_vline(
    x=observed_diff,
    line_width=3,
    line_color="crimson",
    annotation_text=f"observed = {observed_diff:.3f}",
    annotation_position="top",
)
fig.add_vline(
    x=ci_low,
    line_width=3,
    line_color="#72B7B2",
    annotation_text=f"2.5% = {ci_low:.3f}",
    annotation_position="bottom",
)
fig.add_vline(
    x=ci_high,
    line_width=3,
    line_color="#72B7B2",
    annotation_text=f"97.5% = {ci_high:.3f}",
    annotation_position="bottom",
)

fig.update_layout(
    title="Standard bootstrap distribution for the effect + 95% CI",
    xaxis_title="mean difference (treatment - control)",
    yaxis_title="count",
    height=430,
    width=900,
)
fig.show()

Interpreting the CI#

A 95% confidence interval can be read as:

If we repeated the entire data-collection process many times and built the interval each time, about 95% of those intervals would contain the true effect.

For the common “test via CI inversion” rule:

  • For a two-sided test at \(\alpha=0.05\), you reject \(H_0\) if 0 is not inside the 95% CI.

Compared to a p-value, the CI forces you to look at magnitude and uncertainty, not just a threshold.

5) When (and how) to use bootstrap tests in practice#

Bootstrap tests are most useful when you:

  • care about a statistic without a clean analytic null distribution

  • want fewer distributional assumptions than classic parametric tests

  • are willing to pay compute to get a flexible, data-driven approximation

But you must match the resampling scheme to the data structure:

  • One-sample location test (\(H_0: \mu = \mu_0\)): shift the sample so its mean equals \(\mu_0\), then bootstrap.

  • Two-sample independent groups: resample within each group (as done here).

  • Paired / matched data: resample the pairs (or resample the paired differences).

  • Regression: consider a residual bootstrap or pairs bootstrap.

  • Time series / dependence: use block bootstrap variants (simple i.i.d. resampling breaks dependence).

Also note:

  • For “are two groups different?” with exchangeable observations, a permutation test is often preferred for hypothesis testing (exact under \(H_0\)), while the bootstrap shines for confidence intervals.

6) Pitfalls#

  • Garbage in, garbage out: bootstrap assumes your sample represents the population well.

  • Dependence matters: naive bootstrap on correlated data can be very misleading.

  • Small samples: bootstrap approximations can be noisy or biased.

  • Null-world choice matters: different ways of enforcing \(H_0\) can lead to different results.

  • Report more than p-values: always include effect size + uncertainty.

Exercises#

  1. Change the statistic from mean to median and repeat the workflow.

  2. Increase/decrease the sample size and see how the p-value and CI change.

  3. Implement a paired bootstrap test by simulating paired differences.

  4. Compare a bootstrap p-value to a permutation-test p-value for the same data.

References#

  • Efron, B. (1979). Bootstrap methods: Another look at the jackknife.

  • Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and Their Application.

  • Good, P. (2005). Permutation, Parametric and Bootstrap Tests of Hypotheses.